library(leaflet)
library(sf)
library(sp)
library(splines)
library(tidyverse)
library(pubtheme)
library(readr)Pset 5 - Plotly, Shiny, Interactions, Splines - Part 1 and 2
Objectives
- Continued data exploration of US Census data and CT property values data using interactive visualization libraries like
leaflet,plotlyandshiny - Include Interaction terms in a linear model
- Include spline terms in a linear model
Choosing a home location
A prospective homeowner is interested in buying a house in Connecticut. Since she works in New York City near Grand Central Terminal and her partner works in New Haven, she has the following preferences for the location of her new home:
- between NYC and New Haven,
- relatively close to a Metro-North train station,
- special preference towards stations with many options for convenient trains traveling to Grand Central Station and New Haven.
She also prefers to live in a town where
- the house prices aren’t that high, and
- the population is relatively diverse, and where at least 5% of the population is Asian.
1. Visualization
Create an interactive visualization or visualizations that she can use to help her make her decision. Use the data below, which contains
- US Census data (same as what we previously used in class)
- Latitude/longitude for Metro-North stations.
- For every pair of stations, information about trips between those two stations: number of trips, the mean duration, and the minimum duration. This has been filtered so that only trips to/from Grand Central, to/from New Haven, and along the train lines that are in Connecticut (New Haven Line, Danbury Line, Waterbury Line, New Canaan Line) remain.
Here is the data with some light cleaning to help you get started.
Census data
dc = readRDS('data/tracts.and.census.with.EV.stations.rds')
dc = dc[dc$state == 'CT',]
head(dc@data,2) STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD
12844 09 001 090300 1400000US09001090300 09001090300 903 CT
12845 09 001 090400 1400000US09001090400 09001090400 904 CT
ALAND AWATER meters miles state tract county state.full
12844 4764507 0 4764507 1.839586 CT 903 Fairfield County Connecticut
12845 7347827 0 7347827 2.837012 CT 904 Fairfield County Connecticut
pop male female age male.age female.age white black indian.alaskan
12844 4611 2333 2278 42.9 42.7 43.1 4230 24 0
12845 6518 3355 3163 40.6 36.5 45.7 4742 654 78
asian pacific other two.or.more white.not.hisp hisp white.hisp black.hisp
12844 180 0 66 111 3888 435 342 0
12845 524 24 88 408 4324 613 418 0
households i10orless i10to14 i15to19 i20to24 i25to29 i30to34 i35to39
12844 1550 56 8 22 34 16 12 31
12845 2140 24 18 85 0 89 36 49
i40to44 i45to49 i50to59 i60to74 i75to99 i100to124 i125to149 i150to199
12844 19 43 59 67 231 204 197 261
12845 0 94 24 128 219 343 233 421
i200ormore hh.income house.value male.p female.p white.p black.p
12844 290 118819 372100 50.59640 49.40360 91.73715 0.5204945
12845 377 121802 344600 51.47284 48.52716 72.75238 10.0337527
asian.p hisp.p white.not.hisp.p white.hisp.p black.hisp.p other.p
12844 3.903709 9.433962 84.32010 7.417046 0 3.838647
12845 8.039276 9.404725 66.33937 6.413010 0 9.174593
rescaled.house.value hh.income.and.house tot.hh.income tot.house.value
12844 80487.25 99653.12 184169450 576755000
12845 76759.97 99280.99 260656280 737444000
tot.hh.income.and.house pop.density hh.density income.density
12844 154462344 2506.542 842.5807 100114594
12845 212461309 2297.488 754.3148 91877050
house.value.density house.and.income.density lev2 lev3
12844 313524273 83965798 2 4
12845 259936875 74889116 NA NA
Train stations
st = read.csv('data/stops.txt')
st = st %>%
select(stop_id, stop_code, stop_name, stop_lat, stop_lon)
head(st) stop_id stop_code stop_name stop_lat stop_lon
1 1 0NY Grand Central 40.75300 -73.97706
2 4 0HL Harlem-125 St 40.80516 -73.93915
3 622 0YS Yankees-E 153 St 40.82530 -73.92990
4 184 0HI Highbridge Yard 40.83590 -73.93150
5 9 0MH Morris Heights 40.85425 -73.91958
6 10 0UH University Heights 40.86225 -73.91312
Trips
tr = readRDS('data/trips.rds')
tr = tr %>%
arrange(stop1, stop2)
head(tr, 10)
# Read the RDS file
data <- readRDS("data/trips.rds") # Change the filename accordingly
write_csv(data, "trips.csv") # Saves as CSV
readLines("data/stops.txt", n = 5) stop1 tf stop2 n mean min
1 Bethel from Grand Central 246 110.19919 104
2 Bethel to Grand Central 188 117.61702 112
3 Branchville from Grand Central 246 94.19919 88
4 Branchville to Grand Central 188 100.61702 95
5 Bridgeport from Grand Central 3024 90.20602 73
6 Bridgeport to Grand Central 2920 98.40651 77
7 Bridgeport to New Haven 3066 34.35486 25
8 Bridgeport from New Haven 2804 25.17511 20
9 Cannondale from Grand Central 246 87.19919 81
10 Cannondale to Grand Central 188 93.61702 88
[1] "stop_id,stop_code,stop_name,stop_desc,stop_lat,stop_lon,zone_id,stop_url,location_type,parent_station,wheelchair_boarding"
[2] "1,0NY,Grand Central,,40.752998,-73.977056,,http://as0.mta.info/mnr/stations/station_detail.cfm?key=1,0,,1"
[3] "4,0HL,Harlem-125 St,,40.805157,-73.939149,,http://as0.mta.info/mnr/stations/station_detail.cfm?key=2,0,,1"
[4] "622,0YS,Yankees-E 153 St,,40.8253,-73.9299,,http://as0.mta.info/mnr/stations/station_detail.cfm?key=3,0,,1"
[5] "184,0HI,Highbridge Yard,,40.8359,-73.9315,,http://as0.mta.info/mnr/stations/station_detail.cfm?key=,0,,0"
Visualizations
Create your visualization(s) here.
library(sf)
library(ggplot2)
library(dplyr)
library(ggpattern)
dg = st_as_sf(dc)
head(dg,2)
head(dg$geometry, 2)
mean_house_value <- mean(dg$house.value, na.rm = TRUE)
dg$diversity_color <- as.factor(ifelse(dg$asian.p > 5, "Diverse", "Not Diverse"))
dg$house_value_below_mean <- as.factor(ifelse(dg$house.value < mean_house_value, "Below Mean", "Above Mean"))
trips <- tr %>%
left_join(st, by = c("stop1" = "stop_name")) %>%
left_join(st, by = c("stop2" = "stop_name"), suffix = c(".1",".2")) %>%
filter(!is.na(stop_lat.1), !is.na(stop_lat.2), !is.na(stop_lon.2))Simple feature collection with 2 features and 73 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -73.24856 ymin: 41.22053 xmax: -73.17855 ymax: 41.25491
Geodetic CRS: NAD83
STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD
12850 09 001 090300 1400000US09001090300 09001090300 903 CT
12851 09 001 090400 1400000US09001090400 09001090400 904 CT
ALAND AWATER meters miles state tract county state.full
12850 4764507 0 4764507 1.839586 CT 903 Fairfield County Connecticut
12851 7347827 0 7347827 2.837012 CT 904 Fairfield County Connecticut
pop male female age male.age female.age white black indian.alaskan
12850 4611 2333 2278 42.9 42.7 43.1 4230 24 0
12851 6518 3355 3163 40.6 36.5 45.7 4742 654 78
asian pacific other two.or.more white.not.hisp hisp white.hisp black.hisp
12850 180 0 66 111 3888 435 342 0
12851 524 24 88 408 4324 613 418 0
households i10orless i10to14 i15to19 i20to24 i25to29 i30to34 i35to39
12850 1550 56 8 22 34 16 12 31
12851 2140 24 18 85 0 89 36 49
i40to44 i45to49 i50to59 i60to74 i75to99 i100to124 i125to149 i150to199
12850 19 43 59 67 231 204 197 261
12851 0 94 24 128 219 343 233 421
i200ormore hh.income house.value male.p female.p white.p black.p
12850 290 118819 372100 50.59640 49.40360 91.73715 0.5204945
12851 377 121802 344600 51.47284 48.52716 72.75238 10.0337527
asian.p hisp.p white.not.hisp.p white.hisp.p black.hisp.p other.p
12850 3.903709 9.433962 84.32010 7.417046 0 3.838647
12851 8.039276 9.404725 66.33937 6.413010 0 9.174593
rescaled.house.value hh.income.and.house tot.hh.income tot.house.value
12850 80487.25 99653.12 184169450 576755000
12851 76759.97 99280.99 260656280 737444000
tot.hh.income.and.house pop.density hh.density income.density
12850 154462344 2506.542 842.5807 100114594
12851 212461309 2297.488 754.3148 91877050
house.value.density house.and.income.density lev2 lev3
12850 313524273 83965798 2 4
12851 259936875 74889116 NA NA
geometry
12850 MULTIPOLYGON (((-73.24539 4...
12851 MULTIPOLYGON (((-73.22186 4...
Geometry set for 2 features
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -73.24856 ymin: 41.22053 xmax: -73.17855 ymax: 41.25491
Geodetic CRS: NAD83
# Convert stops to sf (filter out missing coordinates)
stations_sf <- st %>%
filter(!is.na(stop_lon) & !is.na(stop_lat)) %>%
st_as_sf(coords = c("stop_lon", "stop_lat"), crs = st_crs(dg))
# Aggregate number of stops per region (filter out missing stops)
stop_counts <- tr %>%
count(stop1) %>%
rename(num_stops = n) %>%
left_join(st, by = c("stop1" = "stop_name")) %>%
filter(!is.na(stop_lon) & !is.na(stop_lat)) # Remove missing coordinates
stop_counts_sf <- stop_counts %>%
st_as_sf(coords = c("stop_lon", "stop_lat"), crs = st_crs(dg))
# Visualization
g <- ggplot() +
# Base layer: Fill color for Diversity, Border color for House Value
geom_sf(data = dg, aes(fill = diversity_color, color = house_value_below_mean), size = 0.5, show.legend = TRUE) +
# Overlay stops with size based on number of stops (Larger Dots = More Stops)
geom_sf(data = stop_counts_sf, aes(size = (1/num_stops)), color = "purple", alpha = 0.7) +
# Custom fill and color scales
scale_fill_manual(values = c("Diverse" = "blue", "Not Diverse" = "grey50"), name = "Diversity (Asian > 5%)") +
scale_color_manual(values = c("Below Mean" = "red", "Above Mean" = "black"), name = "House Value Stats") +
scale_size_continuous(range = c(4, 1), name = "Number of Stops") + # Now bigger dots = more stops
# Titles and theme
labs(title = "Public Transport, Diversity & Housing",
subtitle = "Number of stops, diversity, and housing value",
caption = "Purple Dots: Transit Stops (Bigger Dots = More Stops)") +
theme_minimal()
gMake it interactive:
library(leaflet)
library(dplyr)
library(sf)
# Convert stops data to sf for Leaflet
stations_sf <- st %>%
filter(!is.na(stop_lon) & !is.na(stop_lat)) %>%
st_as_sf(coords = c("stop_lon", "stop_lat"), crs = st_crs(dg))
# Aggregate number of stops per region (filter out missing stops)
stop_counts <- tr %>%
count(stop1) %>%
rename(num_stops = n) %>%
left_join(st, by = c("stop1" = "stop_name")) %>%
filter(!is.na(stop_lon) & !is.na(stop_lat)) # Remove missing coordinates
# Convert aggregated stops into sf
stop_counts_sf <- stop_counts %>%
st_as_sf(coords = c("stop_lon", "stop_lat"), crs = st_crs(dg))
# Create the interactive Leaflet map
leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>% # Light background map
# Add town boundaries with diversity and house value color codes
addPolygons(data = dg,
fillColor = ~ifelse(diversity_color == "Diverse", "blue", "grey"),
color = ~ifelse(house_value_below_mean == "Below Mean", "red", "black"),
weight = 1, fillOpacity = 0.5,
popup = ~paste0("<b>Town:</b> ", NAME, "<br>",
"<b>Diversity (Asian >5%):</b> ", round(asian.p, 2), "%<br>",
"<b>House Value:</b> ", ifelse(house_value_below_mean == "Below Mean", "Below Mean", "Above Mean"))) %>%
# Add train stations with size based on number of stops
addCircleMarkers(data = stop_counts_sf,
radius = ~sqrt(num_stops) * 3, # Bigger dots for more stops
color = "purple", fillOpacity = 0.7,
popup = ~paste0("<b>Station:</b> ", stop1, "<br>",
"<b>Number of Stops:</b> ", num_stops)) %>%
# Add legend for Diversity
addLegend(position = "topright",
colors = c("blue", "grey"),
labels = c("Diverse (Asian >5%)", "Not Diverse"),
title = "Diversity") %>%
# Add legend for House Prices
addLegend(position = "bottomright",
colors = c("red", "black"),
labels = c("Below Mean", "Above Mean"),
title = "House Value Stats")2. Explanation
Briefly explain the choices you made for your visualization(s). What do you think the prospective homeowner will find most useful? Are there any locations in particular that stand out?
I chose to show the homeowner’s three main priorities: diversity, number of stops, and house value. In depicting these three choices with different mechanisms— the outline, size of purple dots, and filled in color, I believe this is an effective way to communicate a lot of information on one graph. I believe the houseowner will find this useful. The locations that first stand out are in the very center or slightly north of the state, as the house value stats are cheaper and the diversity is higher, however they have no stops that connect directly to grand central. So the options further south-west, at the tail end of the purple dots seem to be the best options for the homeowner.
Shiny
3. Basic Shiny App
Do the following to create a basic shiny app and publish it to shinyapps.io
- Configure R Studio to communicate with shinyapps.io by following the directions from this page https://shiny.rstudio.com/articles/shinyapps.html up to and including the “Method 1” section. You will show you how to install
rsconnect, create a shinyapps.io account, and set up communication between R Studio and shinyapps.io. Note that you will only have to do this once. - Create a new shiny app like we did in class by clicking File, New File, Shiny Web App. Name the app
pset5app1, choose Single File, choose an appropriate directory, and click Create. Anapp.Rfile will open in R Studio. - Click Run App to see the app. The app should appear in the Viewer tab in R Studio, or in its own window.
- In the upper right of the app, click Publish. Make sure the correct files and account are selected. Click Publish in the lower right.
Paste the url for your app here. The url will be of the form https://yourusername.shinyapps.io/pset5app1/
url here
https://pearlmallick.shinyapps.io/pset5app1/
id name
1 14025584 pset5app1
2 14026173 pset5_app2
3 14026338 pset5-property-values-interactions-splines-shiny-plotly
4 14030700 app2
url
1 https://pearlmallick.shinyapps.io/pset5app1/
2 https://pearlmallick.shinyapps.io/pset5_app2/
3 https://pearlmallick.shinyapps.io/pset5-property-values-interactions-splines-shiny-plotly/
4 https://pearlmallick.shinyapps.io/app2/
status created_time updated_time size instances guid title
1 sleeping 2025-02-18T02:44:12 2025-02-18T15:01:54 large NA NA <NA>
2 terminated 2025-02-18T04:18:40 2025-02-18T15:59:12 large NA NA <NA>
3 sleeping 2025-02-18T04:49:19 2025-02-18T15:42:55 large NA NA <NA>
4 running 2025-02-18T16:08:22 2025-02-18T16:18:41 large NA NA <NA>
config_url
1 https://www.shinyapps.io/admin/#/application/14025584
2 https://www.shinyapps.io/admin/#/application/14026173
3 https://www.shinyapps.io/admin/#/application/14026338
4 https://www.shinyapps.io/admin/#/application/14030700
4. Shiny App for choosing home location
Create a Shiny app that contains visualization you made in #1, and add at least one widget that allow the user to investigate the data by customizing the visualization in some useful way. For example, you could add a sliderInput or selectInput that helps you filter the data, or a selectInput that allows the user to choose how to color the points, etc.
Different types of user inputs can be found here https://shiny.rstudio.com/gallery/widget-gallery.html
Publish your shiny app to shinyapps.io, and paste the url for your app here:
url here
https://pearlmallick.shinyapps.io/app2/
Summarize in a sentence why you choose those user inputs and what you learned about the data from the app.
I added a slider input to filter towns by house value, a checkbox to show only towns with greater than 5% asian, and a dropdown (select input) to choose a Metro-North station. These inputs allow users to interactively explore affordability, diversity, and public transit accessibility, helping them make informed decisions about where to live. I learned from this data very quickly and easily that there are a few places, in the very middle of CT that would work perfectly for the homeowner, by using the slider and selecting the criteria.
CT Property data
First let’s load the CT property and keep only the New Haven properties.
d = readRDS('data/coast.properties.rds')
d = st_drop_geometry(d)
## I'm gonna change some column names
colnames(d) = gsub('_', '.', colnames(d))
colnames(d)[colnames(d) == 'Assessed.Total'] = 'value'
colnames(d)[colnames(d) == 'Number.of.Bedroom'] = 'beds'
colnames(d)[colnames(d) == 'Number.of.Baths'] = 'baths'
colnames(d)[colnames(d) == 'Number.of.Half.Baths'] = 'half.baths'
colnames(d)[colnames(d) == 'Living.Area'] = 'living'
colnames(d)[colnames(d) == 'SHAPE.Area'] = 'land'
colnames(d)[colnames(d) == 'Condition.Description'] = 'cond'
colnames(d)[colnames(d) == 'State.Use.Description'] = 'use'
colnames(d)[colnames(d) == 'GlobalID'] = 'id'
colnames(d)[colnames(d) == 'Mailing.Address'] = 'address'
colnames(d)[colnames(d) == 'Mailing.City'] = 'city'
colnames(d)[colnames(d) == 'Town.Name'] = 'town'
colnames(d)[colnames(d) == 'Valuation.Year'] = 'val.year'
colnames(d)[colnames(d) == 'Sale.Price'] = 'sale.price'
colnames(d)[colnames(d) == 'Sale.Date'] = 'sale.date'
colnames(d)[colnames(d) == 'AYB'] = 'ayb'
colnames(d)[colnames(d) == 'EYB'] = 'eyb'
d = d %>%
filter(value != 0,
grepl('Single Fam|SINGLE FAM|One Fam|ONE FAM',
use),
Qualified == 'Q',
sale.price != 0,
!is.na(living),
living != 0,
Condition %in% c( 'E', 'EX', ## excellent
'VG', 'G', 'GD', ## very good, good
'A+', 'A', 'AV', 'A-', ## avg+, avg, avg-
'F', 'FR', ## fair
'P', 'VP' ), ## poor, v poor
!duplicated(id)) %>%
mutate(cond = case_when(cond == 'EX' ~ 'Excellent',
cond == 'Avarage' ~ 'Average',
cond == 'AV' ~ 'Average',
cond == 'F' ~ 'Fair',
cond == 'FR' ~ 'Fair',
cond == 'GD' ~ 'Good',
cond == 'G+' ~ 'Very Good',
TRUE ~ cond),
dist = as.numeric(dist),
land = ifelse(land < 0, -land, land)) %>%
select(value, living, land,
beds, baths, half.baths,
cond, use, dist, ayb, eyb,
sale.price, sale.date,
address, city, town, id,
centroid.x, centroid.y,
point.x, point.y)
d = d %>%
filter(town == 'NEW HAVEN')5. Data exploration with ggplotly and plotly
Recall we noticed what looks like two distinct clouds of points in this visualization.
library(ggplot2)
library(plotly)
library(dplyr)
# Create scatter plot with enhanced tooltips
g <- ggplot(d, aes(x = living, y = log(value),
text = paste(
"<b>Town:</b>", town,
"<br><b>Living Area:</b>", living, "sqft",
"<br><b>Land Size:</b>", land, "sqft",
"<br><b>Bedrooms:</b>", beds,
"<br><b>Bathrooms:</b>", baths,
"<br><b>Use:</b>", use,
"<br><b>Address:</b>", address,
"<br><b>Sale Price:</b>", sale.price,
"<br><b>Property Value:</b>", value
))) +
geom_point(color = "navy", alpha = 0.7) + # Set point color to navy
labs(title = "Living Area vs Log(Property Value)",
x = "Living Area (sqft)", y = "Log(Property Value)") +
theme_minimal()
# Convert to interactive plot with plotly
ggplotly(g, tooltip = "text")Use ggplotly or plotly to make a interactive scatter plot of log(value) vs living. Add color, tooltip, or other features that help you investigate why there appears to be two distinct clouds of points. What characteristics do the properties in the upper right cluster of points have in common?
In the upper right cluster of points, they are all more bedrooms and more expensive. Additionally, all the top right cluster addresses are closer to Yale’s campus based off of the addresses in the tooltip, and the addresses on the left-hand bottom part of the graph are streets further away from Yale’s campus. This sort of clustering is super interesting but makes intuitive sense.
6. Data exploration with leaflet
Use leaflet to make an interactive map of the properties. Add color, tooltips, or other features that helps you further investigate why there are two clouds of points. Does this map confirm your conclusions from the previous question?
# Load leaflet library
library(leaflet)
# Define color palette based on property sale price
palette <- colorNumeric(palette = "viridis", domain = d$sale.price)
# Create leaflet map
m <- leaflet(d) %>%
addTiles() %>% # Add basemap
addCircleMarkers(
lng = ~point.x, lat = ~point.y,
radius = 4, # Marker size
color = ~palette(sale.price), # Color by sale price
fillOpacity = 0.7,
popup = ~paste("<strong>Address:</strong>", address,
"<br><strong>Living Area:</strong>", living,
"<br><strong>Land Size:</strong>", land,
"<br><strong>Beds:</strong>", beds,
"<br><strong>Baths:</strong>", baths,
"<br><strong>Sale Price:</strong>", sale.price,
"<br><strong>Year Built:</strong>", ayb)
) %>%
addLegend("bottomright", pal = palette, values = d$sale.price,
title = "Property Value", opacity = 1)
# Print map
mThis does support my previous conclusions! The area right near Yale appear to have more green-tinted dots, while the large majority of the areas around are purple. The tooltip in the Leaflet map includes key property details such as address, living area, land size, number of beds and baths, sale price, and year built, providing a quick summary of each property. The color of the markers reflects the property’s sale price, using a continuous gradient from the “viridis” palette to visually distinguish higher and lower values. Additionally, the legend in the bottom right corner dynamically updates to display the range of property values, helping users interpret the map more easily.